Author : Indumathi Pandiyan

Project on Sequential NLP (Module 2) submitted for PGP-AIML Great Learning on 17-July-2022

Part A - 30 Marks

DOMAIN: : Digital content and entertainment industry

CONTEXT: The objective of this project is to build a text classification model that analyses the customer's sentiments based on their reviews in the IMDB database. The model uses a complex deep learning model to build an embedding layer followed by a classification algorithm to analyse the sentiment of the customers

DATA DESCRIPTION: The Dataset of 50,000 movie reviews from IMDB, labelled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, the words are indexed by their frequency in the dataset, meaning the for that has index 1 is the most frequent word. Use the first 20 words from each review to speed up training, using a max vocabulary size of 10,000. As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word

PROJECT OBJECTIVE: To Build a sequential NLP classifier which can use input text parameters to determine the customer sentiments

Steps and tasks: [ Total Score: 40 Marks]

Import the required Libraries

1. Import and analyse the data set. [5 Marks]

Loading the data set

only the num_words most frequent words are kept. Here As per project requirement 10000 most frequent words should be used.

Based on vocabulary size set the values of unique count changing

2. Perform relevant sequence padding on the data. [5 Marks]

pad_sequence: This function transforms a list (of length num_samples) of sequences (lists of integers) into a 2D Numpy array of shape (num_samples, num_timesteps).
num_timesteps is either the maxlen argument if provided, or the length of the longest sequence in the list.
Sequences that are shorter than num_timesteps are padded with value until they are num_timesteps long.
Sequences longer than num_timesteps are truncated so that they fit the desired length.
The position where padding or truncation happens is determined by the arguments padding and truncating, respectively. Pre-padding or removing values from the beginning of the sequence is the default

3.Perform following data analysis: [5 Marks]

3.A.Print shape of features and labels

Comments
The target feature understanding is as follows on Sentiment

3.B.Print value of any one feature and it's label

The random dataset have value for Positive comments

4. Decode the feature value to get original sentence [5 Marks]

retrieve a dictionary that contains mapping of words to their index in the IMDB dataset

Now use the dictionary to get the original words from the encodings, for a particular sentence

*Comments : Original sentence is retrieved by decoding

5.Design, train, tune and test a sequential model. [5 Marks]

Fit the model

Evaulate the model

Hint: The aim here Is to import the text, process it such a way that it can be taken as an inout to the ML/NN classifiers. Be analytical and experimental here in trying new approaches to design the best model

6. Use the designed model to print the prediction on any one sample. [5 Marks]

Conclusion:

PART B

• DOMAIN: Social media analytics

• CONTEXT: Past studies in Sarcasm Detection mostly make use of Twitter datasets collected using hashtag based supervision but such datasets are noisy in terms of labels and language. Furthermore, many tweets are replies to other tweets and detecting sarcasm in these requires the availability of contextual tweets.In this hands-on project, the goal is to build a model to detect whether a sentence is sarcastic or not, using Bidirectional LSTMs.

• DATA DESCRIPTION:

The dataset is collected from two news websites, theonion.com and huffingtonpost.com. This new dataset has the following advantages over the existing Twitter datasets:Since news headlines are written by professionals in a formal manner, there are no spelling mistakes and informal usage. This reduces the sparsity and also increases the chance of finding pre-trained embeddings. Furthermore, since the sole purpose of TheOnion is to publish sarcastic news, we get high-quality labels with much less noise as compared to Twitter datasets. Unlike tweets that reply to other tweets, the news headlines obtained are self-contained. This would help us in teasing apart the real sarcastic elements.

Content: Each record consists of three attributes:
is_sarcastic: 1 if the record is sarcastic otherwise 0
headline: the headline of the news article
article_link: link to the original news article. Useful in collecting supplementary data Reference: https://github.com/rishabhmisra/News-Headlines-Dataset-For-Sarcasm-Detection

PROJECT OBJECTIVE: Build a sequential NLP classifier which can use input text parameters to determine the customer sentiments

1.Read and explore the data [3 Marks]

To get the parent link from the Article Link

Printing random Headlines and its isSarcastic value

Grouping website with Article link

Observations:
• There are 28619 records with 3 features. The headlines is independant feature and is_sarcastic is dependent feature.
• There are 14985 records(52.4%) isSarcastic 0 and 13634 records(47.6%) as is_Sarcastic is 1
• The data is almost balanced
• The the headlines are from two websites "theonion" and "huffingtonpost" . "theonion" has many suburls • From the analysis made above is very clear that all the Sarcastic headlines from "theonion" website and "huffingtonpost" has non sarcastic headlines.

2. Retain relevant columns [3 Marks]

Comments:

3. Get length of each sentence [3 Marks]

To find the the number of unique words in the headline

Steps to cleanup the Headlines

WordCloud of Headline that is not Sarcastic

WordCloud of Headline that is Sarcastic

Comments

The data at index 7302 has maximum number of words 107 and the same have been printed above

4. Define parameters [3 Marks]

5. Get indices for words [3 Marks]

Apply Tokenizer for the final headline

6.Create features and labels [3 Marks]

Comments:
The final headline is the feature and labels are the is_sarcastic feature

7.Get vocabulary size [3 Marks]

8.Create a weight matrix using GloVe embeddings [3 Marks]

Matrix for word Embedding

9. Define and compile a Bidirectional LSTM model. [3 Marks]

Bi-Directional Long Short Term Memory Bidirectional long-short term memory (Bi-LSTM) is the process of making any neural network o have the sequence information in both directions backwards (future to past) or forward (past to future).

In bidirectional, our input flows in two directions, making a Bi-LSTM different from the regular LSTM. With the regular LSTM, we can make input flow in one direction, either backwards or forward. However, in bidirectional, we can make the input flow in both directions to preserve the future and the past information. For a better explanation, let’s have an example.

In the sentence "boys go to…" we can not fill the blank space. Still, when we have a future sentence “boys come out of school”, we can easily predict the past blank space the similar thing we want to perform by our model and bidirectional LSTM allows the neural network to perform this.

Conclusion: